Skip to content

fix(custom/orfnormalise): treat ribotish GenomePos start as 0-based#12173

Open
pinin4fjords wants to merge 3 commits into
nf-core:masterfrom
pinin4fjords:custom/orfnormalise-ribotish-frame
Open

fix(custom/orfnormalise): treat ribotish GenomePos start as 0-based#12173
pinin4fjords wants to merge 3 commits into
nf-core:masterfrom
pinin4fjords:custom/orfnormalise-ribotish-frame

Conversation

@pinin4fjords

Copy link
Copy Markdown
Member

Summary

custom/orfnormalise mis-parses ribotish predict coordinates. ribotish emits 0-based half-open GenomePos (and Start/Stop) coordinates, but _parse_ribotish_genpos subtracted 1 from the start, treating it as 1-based. This shifts every ribotish-derived ORF 5' by one base:

  • + strand: a full reading-frame shift. The normalised BED12 thickStart/blocks no longer start on the ATG, so any downstream codon-aware consumer (e.g. expanding the catalogue into in-frame codon positions for P-site quantification) walks frame 2, and the catalogue AA FASTA translates with premature in-frame stops.
  • − strand: a 1-nt 3' overhang (ORF span no longer a whole number of codons).

Verification

Against the chr20 reference for a worked ribotish ORF (GenomePos=20:326098-326872:+, AALen=257):

  • corrected 0-based start 326098 → codon ATG, clean translation to 257 aa then stop at codon 258 (matches AALen);
  • the old -1 start 326097 → codon CAT, premature stop at codon 72.

After the fix, all five caller fixtures (ribocode, ribotish, ribotricer, rpbp, price) emit ORFs whose blockSizes sum is a multiple of 3, and every chr20 ORF starts on ATG with no internal stop codons. Only ribotish was affected; the other callers were already correct.

Changes

  • Fix: _parse_ribotish_genpos uses the GenomePos start as-is (the end was already treated as 0-based-exclusive).
  • Test data: migrate the caller fixtures off the deleted pinin4fjords/test-datasets@add-orf-prediction-fixtures branch to params.modules_testdata_base_path (nf-core/test-datasets). The test could not run as-is (dead URLs → 404).
  • Regression guard: each caller test now asserts every emitted ORF spans a whole number of codons (sum(blockSizes) % 3 == 0), which catches this class of coordinate/frame bug for all callers.
  • Snapshots: regenerate custom/orfnormalise and the dependent orftable_fasta_gtf_buildorfcatalogue subworkflow snapshots. Only the ribotish-derived records change.

PR checklist

  • Tests pass (nf-test test modules/nf-core/custom/orfnormalise/... subworkflows/nf-core/orftable_fasta_gtf_buildorfcatalogue/..., NXF 25.04.8, --profile docker).
  • Snapshots regenerated.

🤖 Generated with Claude Code

ribotish `predict` emits 0-based half-open `GenomePos` coordinates, but the
parser subtracted 1 from the start, treating it as 1-based. That shifted every
ribotish-derived ORF 5' by one base: a reading-frame shift on + strand records
(premature in-frame stops, and frame-2 codon positions in any downstream
codon-aware step) and a 3' overhang on - strand. Verified against the genome -
the corrected start lands on the ATG and translates to a clean full-length ORF.

Also:
- migrate the test fixtures off the deleted `pinin4fjords/test-datasets`
  `add-orf-prediction-fixtures` branch to `modules_testdata_base_path`
  (nf-core/test-datasets), which the test now depends on to run at all.
- add a codon-boundary regression guard asserting every emitted ORF spans a
  whole number of codons, which catches this class of coordinate bug for all
  callers.
- regenerate the orfnormalise and orftable_fasta_gtf_buildorfcatalogue
  snapshots (only the ribotish-derived records change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a coordinate-parsing bug in custom/orfnormalise for ribotish predict output by treating GenomePos starts as 0-based (half-open) instead of incorrectly converting them as if they were 1-based. This aligns ribotish-derived ORFs with correct frame/codon boundaries and updates tests/snapshots accordingly.

Changes:

  • Correct _parse_ribotish_genpos to keep the ribotish GenomePos start coordinate as-is (0-based half-open).
  • Update module tests to fetch fixtures via params.modules_testdata_base_path and add a codon-span regression guard (sum(blockSizes) % 3 == 0) across caller fixtures.
  • Regenerate nf-test snapshots for custom/orfnormalise and the dependent subworkflow.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
modules/nf-core/custom/orfnormalise/templates/orfnormalise.py Fix ribotish GenomePos start coordinate handling to be 0-based half-open.
modules/nf-core/custom/orfnormalise/tests/main.nf.test Switch fixtures to shared test-datasets base path and add codon-span regression assertions.
modules/nf-core/custom/orfnormalise/tests/main.nf.test.snap Update expected outputs for ribotish scenario and metadata ordering/timestamps.
subworkflows/nf-core/orftable_fasta_gtf_buildorfcatalogue/tests/main.nf.test.snap Update expected catalogue outputs/snapshots reflecting corrected ribotish-derived records.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/nf-core/custom/orfnormalise/tests/main.nf.test
Comment thread modules/nf-core/custom/orfnormalise/tests/main.nf.test

@suhrig suhrig left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting/fixing this!

The %3 assertion is a great idea!

continue
if b_i < a_i:
a_i, b_i = b_i, a_i
out.append((a_i - 1, b_i))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to omit - 1 here as well? I haven't confirmed empirically, but I'm assuming the blocks column of Ribotish is 0-based, too.

Caution: This function is also used for parsing of Ribotricer output, which may be 1-based.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and you're right to be wary of parse_intervals being shared.

Checked it empirically: ribotish predict emits 19 columns (Gid, Tid, Symbol, GeneType, GenomePos, StartCodon, Start, Stop, TisType, TISGroup, TISCounts, TISPvalue, RiboPvalue, RiboPStatus, FisherPvalue, TISQvalue, FrameQvalue, FisherQvalue, AALen) and no Blocks column, so parse_intervals(row.get("Blocks", ...)) never actually fired - exon structure is always recovered from the GenomePos span intersected with the transcript exons. A 4-exon ribotish ORF in a full pipeline run reconstructs and translates cleanly through that fallback after the fix, so the GenomePos correction is the whole story for ribotish. I've dropped the dead Blocks branch so ribotish always uses that verified path (output unchanged, snapshots still match).

And yes, parse_intervals stays 1-based on purpose: ribotricer is 1-based (its fixtures translate to clean ATG-start, no-internal-stop, len % 3 == 0 ORFs through that -1), so removing it there would have broken ribotricer. Leaving it alone was deliberate.

pinin4fjords and others added 2 commits June 30, 2026 15:03
ribotish `predict` reports a single genomic span (GenomePos) and no per-exon
`Blocks` column, so `parse_intervals(row.get("Blocks", ...))` never fired -
exon structure is always recovered from the GenomePos span intersected with
the transcript's exons. Remove the dead branch so ribotish always uses that
verified path. This also removes a latent foot-gun: `parse_intervals` is
1-based (correct for ribotricer), so had ribotish ever emitted a 0-based
`Blocks` column it would have reintroduced the off-by-one this PR fixes.

Output is unchanged (the branch was dead); snapshots match without regen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pinin4fjords pinin4fjords requested a review from suhrig June 30, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants